Overview¶
This is EDA notebook for understanding the House Prices Dataset. Finding the most correlated features to the House Price. Later, Regression models will be used to build a predictive model to predict the Housing price.
Data:¶
- Data has been used from Kaggle Competition, URL: https://www.kaggle.com/competitions/house-prices-advanced-regression-techniques/overview
- Dataset has 79 Feature columns and one target variable (
SalePrice)
Goal¶
- Explore the Current Data, find patterns, trends and outliers
- Prepare data that can be used for the Regression Models
Index for the Code:¶
- Loading Data
- Cleaning Data
- Data format checking
- Removing Duplicates
- Numeric Data Exploration
- Distributions
- Outlier dentification (Box Plot, Violin Plot)
- Pairplots (To see Correlation)
- Correlation plots with output variable =
SalePrice - HeatMap (To get correlation in metric format)
- Categorical Data Exploration
- Distributions (bar plot, countplot)
- Data Prepataion
- Numeric data
- Standardization/ Normalization
- logarithmic transformation
- Categoric data (One Hot ENcoding/ ..)
- Numeric data
- Model Building
- Multiple Linear Regression
- Hyperparameter Tuning
- Random Forest Regressor
- Hyperparameter Tuning
- XG Boost
- Hyperparameter Tuning
- Multiple Linear Regression
- Evaluation
- Explainability
Insights and Recommendations:¶
Missing Values:
For below Numeric columns, missing values were more than 45% so removing rows would have impacted the data quantity. Hence removing the column itself
Alley: 93.77%
MasVnrType: 59.73%
FireplaceQu: 47.26%
PoolQC: 99.52%
Fence: 80.75%
MiscFeature: 96.30%
For below Numeric columns, missing values were less than 1%, so removing the row to avoid unnecessary imputational skew MasVnrArea: 8 (0.55%)
Electrical: 1 (0.07%)
For below Numeric columns, missing value is more than 5% and less than 10%, it means significant information lies and we can not directly remove the rows that will cause loss of data, hence we be impute these features woth median of the column.
GarageType: 81 (5.55%)
GarageYrBlt: 81 (5.55%)
GarageFinish: 81 (5.55% ) GarageQual: 81 (5.55 %) GarageCond: 81 (5.
5%) 2. Outliers
- Highest Number of Outliers:
- EnclosedPorch: Number of outliers = 183, Percentage of outliers = 13.68%
- MasVnrArea: Number of outliers = 82, Percentage of outliers = 6.13%
- BsmtHalfBath: Number of outliers = 80, Percentage of outliers = 5.98%
- Numeric Variable Distribution:
- Distribution of the Numeric variables are skewed and doesn't feels like normal distribution after visual inspection. Hence the numeric variable need to get tranformed using lognormal transformation.
Import Libraries¶
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
1. Loading Data¶
df = pd.read_csv('train.csv')
df.head(5)
| Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
| 1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
| 2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
| 3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
| 4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 rows × 81 columns
print(df.shape)
print(df.info())
(1460, 81) <class 'pandas.core.frame.DataFrame'> RangeIndex: 1460 entries, 0 to 1459 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Id 1460 non-null int64 1 MSSubClass 1460 non-null int64 2 MSZoning 1460 non-null object 3 LotFrontage 1201 non-null float64 4 LotArea 1460 non-null int64 5 Street 1460 non-null object 6 Alley 91 non-null object 7 LotShape 1460 non-null object 8 LandContour 1460 non-null object 9 Utilities 1460 non-null object 10 LotConfig 1460 non-null object 11 LandSlope 1460 non-null object 12 Neighborhood 1460 non-null object 13 Condition1 1460 non-null object 14 Condition2 1460 non-null object 15 BldgType 1460 non-null object 16 HouseStyle 1460 non-null object 17 OverallQual 1460 non-null int64 18 OverallCond 1460 non-null int64 19 YearBuilt 1460 non-null int64 20 YearRemodAdd 1460 non-null int64 21 RoofStyle 1460 non-null object 22 RoofMatl 1460 non-null object 23 Exterior1st 1460 non-null object 24 Exterior2nd 1460 non-null object 25 MasVnrType 588 non-null object 26 MasVnrArea 1452 non-null float64 27 ExterQual 1460 non-null object 28 ExterCond 1460 non-null object 29 Foundation 1460 non-null object 30 BsmtQual 1423 non-null object 31 BsmtCond 1423 non-null object 32 BsmtExposure 1422 non-null object 33 BsmtFinType1 1423 non-null object 34 BsmtFinSF1 1460 non-null int64 35 BsmtFinType2 1422 non-null object 36 BsmtFinSF2 1460 non-null int64 37 BsmtUnfSF 1460 non-null int64 38 TotalBsmtSF 1460 non-null int64 39 Heating 1460 non-null object 40 HeatingQC 1460 non-null object 41 CentralAir 1460 non-null object 42 Electrical 1459 non-null object 43 1stFlrSF 1460 non-null int64 44 2ndFlrSF 1460 non-null int64 45 LowQualFinSF 1460 non-null int64 46 GrLivArea 1460 non-null int64 47 BsmtFullBath 1460 non-null int64 48 BsmtHalfBath 1460 non-null int64 49 FullBath 1460 non-null int64 50 HalfBath 1460 non-null int64 51 BedroomAbvGr 1460 non-null int64 52 KitchenAbvGr 1460 non-null int64 53 KitchenQual 1460 non-null object 54 TotRmsAbvGrd 1460 non-null int64 55 Functional 1460 non-null object 56 Fireplaces 1460 non-null int64 57 FireplaceQu 770 non-null object 58 GarageType 1379 non-null object 59 GarageYrBlt 1379 non-null float64 60 GarageFinish 1379 non-null object 61 GarageCars 1460 non-null int64 62 GarageArea 1460 non-null int64 63 GarageQual 1379 non-null object 64 GarageCond 1379 non-null object 65 PavedDrive 1460 non-null object 66 WoodDeckSF 1460 non-null int64 67 OpenPorchSF 1460 non-null int64 68 EnclosedPorch 1460 non-null int64 69 3SsnPorch 1460 non-null int64 70 ScreenPorch 1460 non-null int64 71 PoolArea 1460 non-null int64 72 PoolQC 7 non-null object 73 Fence 281 non-null object 74 MiscFeature 54 non-null object 75 MiscVal 1460 non-null int64 76 MoSold 1460 non-null int64 77 YrSold 1460 non-null int64 78 SaleType 1460 non-null object 79 SaleCondition 1460 non-null object 80 SalePrice 1460 non-null int64 dtypes: float64(3), int64(35), object(43) memory usage: 924.0+ KB None
2. Data Cleaning¶
# Check data format for individual columns
print(df.dtypes)
# Check for missing values
print(df.isnull().sum())
# Check for duplicates
duplicate_rows = df[df.duplicated()]
print("Number of duplicate rows:", duplicate_rows.shape[0])
# Remove duplicates
df = df.drop_duplicates()
print("Shape of the dataset after removing duplicates:", df.shape)
MSSubClass int64
MSZoning object
LotFrontage float64
LotArea int64
Street object
...
MoSold int64
YrSold int64
SaleType object
SaleCondition object
SalePrice int64
Length: 80, dtype: object
MSSubClass 0
MSZoning 0
LotFrontage 0
LotArea 0
Street 0
..
MoSold 0
YrSold 0
SaleType 0
SaleCondition 0
SalePrice 0
Length: 80, dtype: int64
Number of duplicate rows: 0
Shape of the dataset after removing duplicates: (1460, 80)
Null values w.r.t. column name¶
def get_null_counts_percentages(df):
null_counts = {}
total_rows = df.shape[0]
for column in df.columns:
null_count = df[column].isnull().sum()
if null_count > 0:
null_percentage = (null_count / total_rows) * 100
null_counts[column] = (null_count, null_percentage)
return null_counts
# Check for missing values
null_counts_percentages = get_null_counts_percentages(df)
print("Number and percentage of null values in each column:")
for column, (count, percentage) in null_counts_percentages.items():
print(f"{column}: {count} ({percentage:.2f}%)")
Number and percentage of null values in each column: Alley: 1369 (93.77%) MasVnrType: 872 (59.73%) MasVnrArea: 8 (0.55%) BsmtQual: 37 (2.53%) BsmtCond: 37 (2.53%) BsmtExposure: 38 (2.60%) BsmtFinType1: 37 (2.53%) BsmtFinType2: 38 (2.60%) Electrical: 1 (0.07%) FireplaceQu: 690 (47.26%) GarageType: 81 (5.55%) GarageYrBlt: 81 (5.55%) GarageFinish: 81 (5.55%) GarageQual: 81 (5.55%) GarageCond: 81 (5.55%) PoolQC: 1453 (99.52%) Fence: 1179 (80.75%) MiscFeature: 1406 (96.30%)
Columns to drop for having more than 45% null value: Alley: 93.77% MasVnrType: 59.73% FireplaceQu: 47.26% PoolQC: 99.52% Fence: 80.75% MiscFeature: 96.30%
Dropping Columns with missing valule more than 45%¶
columns_to_remove = ['Alley', 'MasVnrType', 'FireplaceQu', 'PoolQC', 'Fence', 'MiscFeature']
df = df.drop(columns=columns_to_remove)
df.shape
(1460, 74)
Carrying both dataframes (with missing value, without missing value) for detailed understanding¶
# Create a copy of the original DataFrame with missing values
df_with_missing = df.copy()
# Create a new DataFrame with rows containing missing values removed
df_no_missing = df.dropna()
# Print the shape of both DataFrames
print("Shape of DataFrame with missing values:", df_with_missing.shape)
print("Shape of DataFrame without missing values:", df_no_missing.shape)
Shape of DataFrame with missing values: (1460, 74) Shape of DataFrame without missing values: (1338, 74)
4. Numeric Feature Distribution¶
# Creating Function as there are more than 10 columns
# Select numeric columns
numeric_columns = df_with_missing.select_dtypes(include=[np.number]).columns
# Distributions
for column in numeric_columns:
plt.figure(figsize=(8, 6))
plt.hist(df_with_missing[column], bins=30, edgecolor='black')
plt.xlabel(column)
plt.ylabel('Frequency')
plt.title(f'Distribution of {column}')
plt.show()
# Outlier identification (Box Plot, Violin Plot)
for column in numeric_columns:
plt.figure(figsize=(8, 6))
plt.subplot(1, 2, 1)
sns.boxplot(y=column, data=df_with_missing)
plt.subplot(1, 2, 2)
sns.violinplot(y=column, data=df_with_missing)
plt.tight_layout()
plt.show()
# Pairplots (To see Correlation)
sns.pairplot(df_with_missing[numeric_columns], diag_kind='kde')
plt.show()
# Correlation plots with output variable = 'SalePrice'
correlation_matrix = df_with_missing[numeric_columns].corr()
plt.figure(figsize=(12, 10))
sns.heatmap(correlation_matrix, cmap='coolwarm', annot=True, square=True)
plt.title('Correlation Matrix')
plt.show()
# HeatMap (To get correlation in metric format)
target_column = 'SalePrice'
plt.figure(figsize=(12, 8))
sns.heatmap(df_with_missing[numeric_columns].corr()[[target_column]].sort_values(by=target_column, ascending=False),
cmap='coolwarm', annot=True, vmin=-1, vmax=1)
plt.title('Correlation with SalePrice')
plt.show()
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):
C:\Users\e1449629\AppData\Local\anaconda3\Lib\site-packages\seaborn\_oldcore.py:1119: FutureWarning: use_inf_as_na option is deprecated and will be removed in a future version. Convert inf values to NaN before operating instead.
with pd.option_context('mode.use_inf_as_na', True):